ARI  TECHNICAL  REPORT 
TR- 79-All 


i 


u 


CO 

00 


© 


© “ 

by 


Robert  M.  Guion 
and 


Gail  H.  Ironson 


Principles  of  Work  Sample  Testing: 

IV.  Generalizability 


BOWLING  GREEN  STATE  UNIVERSITY 
Bowling  Green,  Ohio  43403 


April  1979 


Contract  DAHC  19-77-C-0007 

Q_ 

O 

I— i f Prepared  for 


U.S.  ARMY  RESEARCH  INSTITUTE 

for  the  BEHAVIORAL  and  SOCIAL  SCIENCES 

5001  Eisoahower  Aveaae 

Alexaadria,  Vlrgiaia  22333 


Approved  for  public  release;  distribution  unlimited. 

79  0 7 12  03  5 


u.  s.  ARMY  RESEARCH  INSTITUTE 

FOR  THE  BEHAVIORAL  AND  SOCIAL  SCIENCES 


A Field  Operating  Agency  under  the  Jurisdiction  of  the 
Deputy  Chief  of  Staff  for  Personnel 

WILLIAM  L.  HAUSER 

JOSEPH  ZEIDNER  Colonel,  US  Army 

Technical  Director  Commander 


NOTICES 


DISTRIBUTION:  Primary  diminution  of  this  report  he*  been  made  by  ARI.  Pits**  tddrei*  correspondence 
concerning  diitribution  of  report*  to:  U.  S.  Army  Reeeereh  Institute  for  the  Behtviorsl  end  Sociel  Science*. 
ATTN:  PERI-P,  5001  Eisenhower  Avenue.  Altxtndrit,  Virgim*  22333. 


FINAL  DISPOSITION:  Thi*  report  mey  be  destroyed  when  it  i*  no  longer  needed.  Plee*e  do  not  return  it  to 
the  U.  S.  Army  Research  Institute  for  the  Behavioral  snd  Social  Science*. 


The  finding*  in  thi*  report  are  not  to  be  construed  **  *n  official  Depertment  of  the  Army  position, 
unless  so  designated  by  other  euthprited  document*. 


unclassified 

SECURITY  CLASSIFICATION  OF  THIS  FACE  (Whan  Data  Bntarad) 


I.  REPORT  NUMBER 

TR-79-All~")  | 


REPORT  DOCUMENTATION  PAGE 

»EW  _ ^ |2.  GOV 

) 6?; X E 


2.  SOVT  accession  no. 


4.  title  find  Subiltla ) 

PRINCIPLES  OF  WORK  SAMPLE  TESTING. 
IV.  GENERALIZABILITY,  


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 
3.  RECIPIENT'S  CATALOG  NUMBER 


IYPF  nE  RBRftBT  » PERIOD  COVERED 

Final  " - — 

15  Nov  »76  - 15  Jun«*»78  0 

S.  PERFORMING  ORli.  REPORT  NUMBErf'-- 


7.  AUTHORO) 

Robert  M.  /6uion  ! 

Gail  H.^Ironson  ! 

».  PERFORMING  ORGANIZATION  NAME  AND  AOORESS 

Bowling  Green  State  University 
Bowling  Green,  Ohio  43403 


•.  CONTRACT  OR  GRANT  NUMBERf.; 


M.  CONTROLLING  OFFICE  NAME  AND  AOORESS 

US  Army  Research  Institute  for  the  Behavioral  I j f 
and  Social  Sciences  V..-'1 

5001  Eisenhower  Avenue,  Alexandria,  VA  22333 

U.  MONITORING  AGENCY  NAME  • AOOREBBfff  dlliaaant  ham  Con  trailing  Olllea) 


DAHC19-77-C-/)0Q7  j 


10.  PROGRAM  ELEMENT.  PROJECT,  TASK 
AREA  A WORK  UNIT  NUMBERS 


( A 2Q1611S)2B74F 


' Apr  JLJW79  \ 

IS.  NUMBER  OF  PAGES 

29 

IS.  SECURITY  CLASS,  (ot  Oil  a rafter* ) 

Unclassified 

ISa.  DECL  ASSI FICATION/ DOWN  GRADING 
SCHEDULE 


IB.  DISTRIBUTION  STATEMENT  (el  Me  Hapart) 

Approved  for  public  release;  distribution  unlimited. 


17.  DISTRIBUTION  STATEMENT  (at  Ota  ahatract  anlatad  In  Mao*  30,  It  dlttaranl  tram  ttapart) 


IB.  SUPPLEMENTARY  NOTES 

Monitored  by  G.  Gary  Boy can,  Engagement  Simulation  Technical  Area,  Army 
Research  Institute 

It.  KtY  WORDS  (Continue  on  reweree  ml  dm  II  neceeeery  W Identity  by  block  number) 

Measurement  theory,  psychometrics,  work  sample  testing,  validity,  content- 
referenced  testing,  criterion-referenced  testing,  generalizability  theory 


a*,  a eye  act  rramn  « »>—  «mw  * —— — if  amt  mmip  *r  m •«*  inai») 

Three  kinds  of  generalizability  are  described:  generalizability  of  test 
:ontent,  generalizability  of  relationships  such  as  are  described  by  validity 
:oefficients,  and  generalizability  of  test  scores  across  varying  conditions. 

Ill  are  deemed  important  in  the  construction  and  evaluation  of  work  sample 
:ests.  A special  problem  is  determining  the  limits  of  generalizability;  it 
Ls  suggested  that  the  research  designs  of  generalizability  theory  can  be 
ipplied  in  some  cases  to  determining  the  limits  of  the  generalizability  of 
:riterion-related  validities.  — /vA  1 


00  » jmTtb  M73  EomoMttr  rwovesi 


UNCLASSIFIED  it  Z ' 

Mfc  — - la 

»CCUW« TV  CLASSIFICATION  OF  THIS  PAM  (When  Dele  Entered) 


UNCLASSIFIED 


MCUWTY  CUAMiriCATjOjl  Of  THI»  t»AatfWt1  DMm  l>IM« 


20.^»ntinued) 

This  report,  the  last  of  four,  is  written  for  psychologists  and  others 
interested  in  theories  of  testing. 


) 


I 


PRINCIPLES  OF  WORK  SAMPLE  TESTING:  IV.  GENERALIZABILITY 


BRIEF 


Generalizability  has  been  identified  as  a major  concern  in  the 
evaluation  of  work  sample  tests.  Generalizability  is  also  an  ambiguous 
term  that  has  had  substantially  different  meanings  in  different  con- 
texts. It  sometimes  refers  to  generalizability  of  test  content  to  a 
broader  test  domain.  It  has  often  been  used  in  the  context  of 
validity  generalization,  a special  case  of  the  generalizability  of 
relationships.  It  has  been  recently  used  frequently  to  refer  to 
generalizability  theory,  or  an  approach  to  the  generalizability  or 
dependability  of  scores. 

An  exanple  of  systematic  effort  to  assure  generalizability  of 
content  is  cited  from  the  development  of  a model  tank  gunnery  test 
(Wheaton,  Fingemen,  & Boycan,  1977) ; this  work  is  cited  particularly 
for  the  development  of  an  index  of  generalizability.  Its  applicability 
is  particularly  great  for  direct  work  sanples.  For  abstracted  work 
samples,  special  attention  should  also  be  given  to  the  importance  of 
specific  task  variables  and  to  matching  these  variables  in  the  construc- 
tion of  the  abstracted  sample. 

Generalizability  of  relationships,  as  in  the  testing  of  a criterion- 
related  hypothesis,  is  understood  best  in  the  exanple  of  the  general- 
izability of  criterion- related  validities,  but  there  are  other  examples. 
It  is  a criterion-related  hypothesis  that  needs  to  be  tested  and, 
if  supported,  evaluated  for  its  generalizability  when  a literal 
measurement  of  one  variable  is  used  for  inferring  measuring  of  a 
different  variable;  this  is  usually  done  under  the  rubric  of  construct 
validity.  A similar  exanple  is  the  abstracted  work  sample,  in  which 
the  literal  performance  of  the  unreal  task  is  used  for  inferring 
performance  on  the  real  one.  Studies  of  generalizability  of  relation- 
ships need  to  pay  particular  attention  to  the  limits  of  such  general- 
izability. 

Generalizability  theory,  as  proposed  by  Cronbach,  Gleser,  Nanda, 
and  Rajaratnam  (1972)  refers  to  the  generalizability  of  scores. 

Several  designs  for  such  studies  are  described.  A general  conclusion 
is  reached  that  the  limits  of  validity  generalization  (i.e.,  of  the 
generalizability  of  relationships)  might  be  studied  by  the  general- 
izability theory  research  designs. 
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INTRODUCTION 


Each  of  the  preceding  reports  of  this  series  has  concluded  that 
a major  consideration  in  the  evaluation  of  a work  sample  test  is  its 
general i zabi 1 ity . Unfortunately,  the  word  is  so  ambiguous  that  the 
conclusion  can  mean  different  things  to  different  readers.  Consider 
the  following  list  of  statements  drawn  primarily  from  the  other  three 
reports: 


1.  In  an  experiment  for  evaluating  programs  or  materiel,  the 
sample  of  performance  used  as  the  dependent  variable  of  the 
experiment  should  be  representative  of  — that  is,  should 
generalize  bo  — the  performance  in  a target  situation. 

2.  The  properties  of  the  tasks  included  in  a work  sample  should 
be  representative  of  the  properties  of  the  tasks  in  the 
domain  from  which  they  are  sampled;  for  example,  if,  in  the 
job  content  domain  the  job  is  structured  so  that  each  batch 
of  20  completed  assemblies  is  a recorded  unit,  then  the  work 
sample  should  be  based  on  a similar  work  unit.  To  the  extent 
it  does  not,  performance  on  the  work  sample  may  not  generalize 
bo  performance  on  the  real  job. 

3.  The  first  step  in  the  evaluation  of  a work  sample  is  to  eval- 
uate the  degree  to  which  it  is  congruent  with  the  domain  being 
sampled;  are  the  salient  task  characteristics  adequately 
sampled  in  appropriate  proportions? 

4.  The  simple,  abstracted  work  sanple  is  best  evaluated  in  terms 
of  hew  well  performance  on  the  abstraction  can  be  used  to 
infer  performance  on  a more  highly  complex  real  job.  Perform- 
ance cn  the  abstract  portion  must  generalize  to  performance 
on  the  whole. 

5.  In  measurement  in  institutional  settings,  the  actual  measure- 
ments obtained  usually  are  of  less  concern  than  are  attributes 
to  be  inferred  or  predicted  frcm  them;  the  evaluation  of 
measurement  under  conditions  of  institutional  control  is 
based  on  the  degree  to  which  the  scores  will  generalize  to 
attributes  of  greater  institutional  concern. 


6.  Measurement  of  one  attribute  is  often  inferred  from  literal 

measurement  of  quite  a different  attribute,  such  as  the  use 
of  reaction  time  to  infer  seme  aspect  of  information  process- 
ing. The  usefulness  of  such  an  arrangement  depends  on  how  I 

well  performance  on  the  one  attribute  generalizes  to  perform- 
ance on  the  are  inferred. 

7.  The  criterion- related  validity  obtained  in  one  setting  may  / 

generalize  to  a different  setting  if  the  characteristics  of 

the  subject  populations,  tasks,  and  settings  are  essentially 
similar. 

8.  A disadvantage  of  tight,  experimental  control  of  measurement 

in  a laboratory  situation  is  that  the  situation  may  differ  so  ; 

substantially  from  the  target  situation  that  measurement  in  ; 

the  one  will  not  generalize  to  the  other;  where  the  situations 
differ  markedly,  "generalizability  must  be  assessed." 

9.  The  evaluation  of  work  sanples  in  field  settings  is  incomplete 
unless  it  can  be  shown  hew  well  performance  in  the  field  setting 
used  will  generalize  to  performance  in  some  targeted  setting. 

10.  The  variables  that  describe  a task  — autonomy,  demand  for 

unwavering  attention  or  other  skills,  amount  of  feedback,  and 
others  — may  change  systematically  as  characteristics  of  the 
situations  in  which  performance  is  measured  change;  as  task 
variables  change,  generalizability  may  be  reduced. 


All  of  these  statements  are  special  instances  of  the  broader  state- 
ment that,  regardless  of  purpose,  setting,  or  combination  of  variables 
and  methods  of  measurement,  generalizability  is  the  universally  required 
characteristic  of  effective  measurement.  They  do  not,  however,  merely 
say  the  same  thing  in  ten  different  ways.  The  first  three  statements 
refer  to  the  generalizability  of  the  content  of  the  measurement;  they 
deal  broadly  with  the  generalizability  of  the  actual  content  of  the 
content  sanple  chosen  to  the  rest  of  the  domain  that  might  have  been 
chosen.  The  next  four  statements  all  deal  with  the  generalizability 
of  a relationship;  they  ask  about  the  relationship  of  the  variable 
measured  to  a variable  one  may  perhaps  wish  had  been  measured,  and, 
in  statement  7,  about  the  generalizability  of  such  a relationship 
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across  settings.  The  last  three  statements  all  refer  explicitly  to 
the  generalizability  of  scores,  or  of  inferences  from  scores,  across 
different  conditions  of  measurement. 

In  almost  every  one  of  the  statements  in  that  list,  there  is  a 
sense  in  which  all  three  kinds  of  generalizability  may  be  relevant; 
nevertheless,  these  three  can  be  examined  independently  both  for 
the  concepts  they  represent  and  for  the  possible  methods  available 
for  evaluating  them. 

GENERALIZABILITY  OF  CONTENT 

Every  sarrpling  is  intended  to  generalize  to  the  whole  from  which 
it  is  a sanple.  Random  sairpling  in  experimental  work  is  intended  to 
assure  generalizability  of  results  to  the  population  sampled;  strati- 
fied sarrpling  is  the  effort  of  people  whose  faith  in  random  sairpling 
is  small  to  assure  such  generalizability. 

So  it  is  with  content  sairpling  in  the  development  of  a work  sample. 
Each  stage  in  the  process  is  a sairpling  exercise,  not  necessarily 
representative,  which  defines  the  part  or  aspect  of  the  previous  whole 
to  which  conclusions  of  some  sort  are  expected  to  generalize.  Job 
content  domains  are  intended  to  generalize  to  certain  aspects  of  job 
content  universes;  so  it  is  also  with  test  content  domains  and  universes , 
and  a test  content  domain  is  expected  to  have  potential  for  generalizing 
to  selected  aspects  of  a job  content  domain.  And  the  sanple  of  the 
test  content  domain  ultimately  chosen  is  certainly  expected  to  general- 
ize to  the  test  content  dorrain;  that  is  what  so-called  content  validity 
is  about.  The  problem  is  the  choice  of  procedures  by  which  one  may 
feel  comfortable  in  making  generalizations  from  the  sample  to  the 
whole. 
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A CASE  STUDY  IN  ITEM  SAMPLING 


Wheaton,  Fingenmn,  and  Boycan  (1978)  dealt  explicitly  with  the 
generalizability  problem  in  item  saitpling  and  developed  an  index  of 
generalizability  to  be  applied  to  "items"  of  a work  sample.  Hie  work 
was  that  of  a tank  crew  in  "neutralizing"  a target;  the  items  were 
job  objectives  (of  which  they  identified  266) . Each  job  objective 
consisted  of  some  subset  of  a possible  114  behavioral  elements.  By 
cluster  analysis  of  a matrix  of  the  266  objectives  and  114  elements, 
fourteen  empirical  and  two  rational  clusters  of  objectives,  or 
families,  were  identified.  (In  the  third  report  of  this  series,  the 
steps  reoonmended  for  job  analysis  follow  a similar  pattern;  to  com- 
pare the  two  reports,  job  objectives  can  be  equated  with  the  component 
tasks  of  report  III;  behavioral  elements  to  task  elements;  and  families 
to  task  categories.) 

Hie  procedure  followed  was  related  by  Wheaton  et  al.  to  the 
classical  procedures  of  item  analysis;  they  argued  (a)  that  classical 
item  analysis  produced  a set  of  items  for  a final  test  that  would 
generalize  to  the  whole  and  (b)  that  classical  item  analysis  was  too 
expensive  to  duplicate  in  a tank  gunnery  test.  Hie  present  report 
will  not  follow  the  first  of  these  arguments;  what  they  did  seems 
preferable  to  the  more  expensive  item  analysis.  Empirical  item  anal- 
ysis techniques  will  succeed  in  selecting  a subset  of  all  original 
items  that  will  imximize  the  overall  variance  in  that  set.  That  is, 
a skillfully  conducted  item  analysis,  with  certain  sets  of  data,  will 
result  in  the  elimination  of  those  items  which  have  a lew  correlation 
with  the  oonposite  of  all  other  items,  and  the  resulting  test  will 
have  a variance  comparable  to  that  of  any  other  similar-size  subset 
of  items  and  perhaps  even  a reliability  as  good  as  the  total  set. 
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Items  eliminated,  however,  may  be  eliminated  only  because  they 
are  low  in  variance  in  the  particular  sample  in  which  the  item  analysis 
is  done.  Alternatively,  the  items  my  be  eliminated  because  they  have 
low  common  variance,  although  substantial  variance  of  their  cwn,  and 
therefore  do  not  contribute  to  the  internal  consistency  of  the  final 
test. 


The  argument  of  these  reports  is  that  high  levels  of  internal 
consistency  are  not  necessary  in  work  sample  tests,  even  though  sub- 
scores  on  a work  sample  should  have  enough  internal  consistency  to 
demonstrate  some  degree  of  functional  unity.  If  this  argument  is 
accepted,  it  follows  that  an  item  need  not  be  eliminated  from  the 
test  (or  subtest)  unless  it  actually  shows  a zero  or  no native  corre- 
lation with  the  rest  of  the  test.  Thus  an  empirical  item  analysis 
might  have  eliminated  some  items  identified  as  important  by  the 
procedures  used  by  Wheaton  et  al.  — items  which  helped  assure  that 
the  final  test  was  indeed  representative  of  the  behaviors  actually 
required  on  the  job. 

The  rational  foundation  for  the  approach  is  the  assumption  that 
the  more  behavioral  elements  that  one  job  objective  has  in  common  with 
other  job  objectives,  the  more  generali zable  is  the  performance  on  it 
relative  to  the  others.  Arbitrarily  but  correctly,  they  believed  that 
each  family  of  job  objectives  should  be  represented  in  the  final  work 
sample  according  to  the  number  of  objectives  it  contained;  that  is, 
a family  or  cluster  of  objectives  with  30  objectives  in  it  should  be 
represented  three  times  as  heavily  as  a ten-objective  cluster.  The 
issue  was  to  choose  the  one  or  the  three  objectives  as  items  of  the 
vrork  sample  so  as  to  maximize  the  generality  of  the  resulting  test. 

Several  methods  were  considered  and  rejected.  For  example,  one 
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might  take  the  objectives  with  the  largest  number  of  behavioral  ele- 
ments. This  was  rejected  because  many  of  the  elements  might  occur 
only  in  that  one  objective,  that  is,  they  might  not  generalize  to  the 
family  of  objectives. 

TWo  methods  were  considered  intuitively  sound  and  were  ultimately 

combined.  Each  of  them  is  a method  of  obtaining  a weighted  gum  of 

the  behavioral  elements  included  in  each  job  objective.  One  method  of 

lighting  assigns  a weight  equal  to  the  frequency  that  element  i 

occurs  within  a family,  that  is,  F- . This  assigns  the  greatest  weight 

to  the  most  commonly  occurring  elements.  The  second  method  is  to 

assign  a weight  of  , wl«ere  is  the  frequency  with  which  the 

element  occurs  in  the  total  domain;  this  method  would  give  the 

greatest  weight  to  an  element  which  occurs  almost  exclusively  within 

the  one  family,  even  though  it  nay  not  appear  very  often.  The  two 

methods  were  combined  by  multiplying  them  so  that  the  weight 

2 

recommended  by  Wheaton  et  al.  is  E(F^/D^),  called  the  index  of 
general izability,  for  each  job  objective.  Although  their  decisions 
were  based  in  part  on  practical  issues  of  cost  as  well  as  on  these 
indices,  a relatively  inexpensive  content  domain  could  be  sampled 
largely  empirically  through  the  use  of  such  an  index. 

ABSTRACTED  TESTS 

The  item  sampling  reported  by  Wheaton  et  al.  is  an  example  of 
the  direct  work  sample;  that  is,  the  job  objectives  chosen  were  compo- 
nent parts  of  the  actual  jobs  selected  because  they  were  more  widely 
representative  than  others.  In  the  abstracted  work  sample,  the 
"items"  — or  at  least  the  composites  which  the  set  of  items  produce  — 
may  be  created  rather  than  selected  by  sampling.  The  abstracted  work 
sample  identifies  certain  components  — task  elements  — and  puts  them 
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together  in  such  a way  as  to  represent,  if  not  to  reserrble,  the 
l_  essential  or  crucial  elements  of  the  job  itself.  Again,  the  purpose 

is  to  select  a set  of  final  tasks  that  will  generalize  to  the  whole 
of  the  original  content  domain,  even  if  less  intuitively  obviously. 

The  task  elements  might  be  chosen  by  a procedure  not  unlike  that 
for  weighting  elements  in  the  Wheaton  et  al.  example.  An  alternative 
is  to  have  a panel  of  qualified  judges  (i.e. , qualified  because  of 
their  knowledge  of  the  job)  rate  each  element  both  for  frequency  and 
importance.  Although  these  ratings  will  probably  be  highly  corre- 
lated, they  are  both  necessary  if  the  task  elements  that  occur  rarely 
but  are  crucial  when  they  occur  are  to  be  identified  (or  conversely, 
if  there  is  to  be  any  success  in  identifying  the  task  element  that 
is  important  because  it  is  pervasive. ) Seme  combination  of  these 
ratings  can  be  used  as  the  basis  for  selecting  the  task  elements  to 
be  abstracted  that  will  generalize  most  handily  to  the  domain,  at 
least  as  the  term  is  used  by  Wheaton  et  al.  There  remains,  however, 
the  problem  of  putting  these  elements  together  into  a task  so  that 
performance  on  the  artificial,  abstracted  task  will  generalize  to 
performance  on  the  real  job. 

At  the  outset,  it  should  be  clear  that  this  poses  an  empirical, 
not  simply  a rational  problem.  If  little  abstraction  is  involved, 
there  may  be  no  reasonable  basis  to  question  the  relevance  of  the 
abstraction  to  the  job  as  a whole;  if  the  abstraction  is  severe, 
however,  the  relationship  between  performance  on  it  and  performance 
on  the  job  itself  is  subject  to  investigation  by  empirical  study. 

Some  things  can  be  done  in  the  test  development  stage,  however, 
that  will  either  minimize  the  need  to  do  the  empirical  study  or 
maximize  its  probability  of  success.  These  are  things  that  maximize 
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the  basic  similarities  between  the  real  tasks  and  the  artificial 
abstracted  tasks. 

'Ihe  first  report  of  this  series,  the  taxonomy  of  testing, 
reported  nine  possible  categories  of  task  variables.  These  included; 

1.  Duration  or  intensity  of  attention 

2.  Hazards 

3.  Degree  of  task  structure 

4.  Organizational  involvement 

5.  Task  oonplexity 

6.  Intrinsic  feedback 

7.  Skill  demands 

8.  Significance 

9.  Autonomy 

It  would  appear  intuitively  that  attempts  to  match  the  two  tasks 
on  the  most  salient  of  these  kinds  of  variables  would  assure  general - 
izability  of  the  abstraction  to  the  domain. 


INDUSTRY-WIDE  RESEARCH 

There  seems  to  be  growing  concern  for  developing  research  pro- 
grams to  validate  employment  tests  on  an  industry-wide  basis;  the 
banking  industry,  the  insurance  industry,  and  the  electric  power 
industry  have  all  initiated  such  research.  In  such  studies,  the 
purpose  is  to  develop  tests  that  will  generalize  across  various  organ- 
izations which,  independently,  rrake  up  the  industry.  In  part,  this 
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discussion  belongs  under  the  next  rrajor  heading,  the  generalizability 
i_  of  relationships,  since  the  major  target  is  to  develop  criterion- 

related  validity  statements  that  will  generalize  from  one  corrpany  to 
another.  However,  before  that  portion  of  the  objective  can  be 
reached,  it  is  first  necessary  to  define  the  job  domains  to  be  tested. 
Different  jobs  in  different  organizations  may  involve  the  same  total 
collection  of  component  tasks  put  together  in  different  ways.  The 
problem  is  to  identify  a subset  of  corrponent  tasks  which  will  them- 
selves have  generality,  which  is  to  say,  will  generalize,  across 
organizations. 


GENERALI ZABILTIY  OF  RELATIONSHIPS 


Every  criterion-related  hypothesis  is  an  exanple  of  a generaliza- 
tion; the  hypothesis,  if  supported,  means  that  inferences  about  a 
criterion  can  be  drawn  (or  generalized)  from  performance  on  a predic- 
tor. If  the  hypothesis  is  supported  often  enough,  and  most  of  the 
time,  the  validity  of  the  hypothesis  is  itself  generalized  to  situa- 
tions beyond  that  in  which  it  was  initially  supported. 

THE  CRITERION- RELATED  HYPOTHESIS 

The  most  corrnionly  proposed  hypothesis  in  personnel  testing  is 
the  oredictive  hypothesis  that  performance  on  an  employment  test  can 
be  used  in  a functional  equation  to  predict  performance  on  the  job. 
Variations  on  the  theme  include  the  hypothesis  that  a work  sanple  can 
be  used,  perhaps  in  something  like  a discriminant  function,  to 
classify  examinees  correctly  into  master  and  non-master  categories 
defined  on  some  basis  other  than  the  test.  In  such  exanples,  it  is 
clear  that  the  predictor,  the  test  score  variable,  and  the  criterion, 
a job  performance  variable,  are  measuring  quite  different  classes  of 
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attributes.  The  fact  of  different  variables  is  what  sets  the  inves- 
tigation of  the  criterion-related  hypothesis  apart  from  investigations 
of  test  validity  which  inquire  into  the  excellence  of  the  test  as  a 
measure  of  some  variable. 

The  list  of  statements  at  the  outset  in  this  report  refers  to  the 
conventional  criterion- related  hypothesis  in  statement  5;  perhaps  this 
statement  describes  the  essential  criterion- related  hypothesis  for 
personnel  testing:  the  hypothesis  that  the  ordering  of  people  accord- 
ing to  their  test  scores  can  generalize  to  their  order  according  to  a 
variable  "of  greater  institutional  oonoem. " The  essence  of  the 
criterion  is  that  it  is  the  variable  of  greater  organizational 
interest.  Whenever  scores  on  a test  am  vised  not  as  descriptions  of 
test  performance  but  as  indicators  of  something  else  that  is  mare 
important,  a criterion-related  hypothesis  is  at  least  inplied.  TWo 
kinds  of  implied  hypotheses  need  to  be  identified,  the  hypothesis 
inplied  by  using  one  variable  as  a presumed  measure  of  a different 
variable,  and  the  specific  example  of  that  presumption  in  work  sample 
testing  where  an  abstraction  of  the  job  is  a presumed  measure  of 
performance  on  the  real  thing. 

Inferences  from  Lateral  Measures.  Psychological  measurement 
typically  measures  one  thing  for  the  purposes  of  inferring  something 
of  greater  interest.  The  nurber  of  arithmetic  problems  correctly 
solved  in  a specified  time  period  is  taken  as  a sign  of  an  underlying 
construct  called  numerical  fluency.  Correct  identification  of  where 
small  areas  on  a flat  surface  would  be  found  if  the  surface  were 
folded  into  three  dimensioned  objects  is  used  at  one  level  as  a 
measure  of  an  underlying  ability  called  visualization  and,  at  another, 
of  a construct  called  mechanical  aptitude.  The  example  used  earlier 
is  of  measuring  the  electrical  resistance  on  the  surface  of  the  skin 
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and,  as  it  changes,  inferring  changes  in  emotionality. 

These  are  criterion-related  hypotheses,  but  they  are  not  among 
the  examples  of  criterion-related  validation  in  conventional  approaches 
to  personnel  testing.  Rather,  the  evaluation  of  these  hypotheses 
proceeds  along  the  lines  known  as  construct  validation;  if  a substan- 
tial body  of  research,  consisting  largely  of  concurrent  criterion- 
related  validity  studies  testing  these  hypotheses,  supports  these 
presumptions,  then  the  literal  measure  is  said  to  be  valid  for 
measuring  (or,  more  properly,  inferring)  the  more  interesting  measure. 
In  a dictionary  sense  of  the  term,  such  an  inference  is  a generaliza- 
tion from  the  narrowly  defined  literal  measure  to  the  broader  meaning 
of  the  variable  of  greater  interest.  Such  generalization  nearly 
always  requires  some  form  of  empirical  support,  although  it  is  not 
alvrays  kncwn  as  construct  validity.  In  experimental  psychology, 
particular  measures  are  generalized  because  of  the  weight  of  the 
literature  using  them.  If  a set  of  operations  for  defining  a con- 
struct becomes  widely  enough  accepted,  the  issue  of  construct 
validity  in  its  psychometric  sense  is  not  raised.  Thus,  if  there 
is  enough  acceptance  of  performance  on  a specific  set  of  tasks,  none 
of  which  is  genuinely  drawn  from  a job,  as  an  indication  of  perform- 
ance on  a job  these  tasks  were  designed  to  reflect,  the  construct 
validity  of  the  interpretation  of  the  performance  as  a general 
ability  is  not  questioned. 

Abstracted  Work  Sanples.  That  is  to  say,  the  abstraction  that 
occurs  in  the  development  of  a work  sanple  test  nay  be  accepted  without 
raising  questions  of  construct  validity.  The  requirements  are  that 
informed  people  understand  the  nature  of  the  abstraction  and  its 
relevance  to  the  actual  job  or  that  a history  of  its  use  has  demon- 
strated empirically  that  relevance.  The  latter  is  unlikely;  unlike 


- 11  - 


experiments  in  basic  psychology,  the  development  of  an  abstracted 
work  sample  is,  unless  the  level  of  abstraction  is  extreme,  an  ad  hoc 
affair. 


Hie  nature  of  the  abstraction  and  of  its  relevance  may  be  self- 
evident  to  informed  people  if  the  degree  of  abstraction  is  not  great. 
For  exanple,  it  is  unlikely  that  any  informed  and  reasonable  critic 
would  object  to  a work  sample  test  for  cabinet  makers  that  resulted  in 
no  useful  product  but  did  manage  to  incorporate  nearly  all  of  the  most 
difficult  kinds  of  joining,  routing,  and  other  skills.  Real  questions 
could  be  expected,  however,  if  a test  of  steadiness  in  running  a piece 
through  the  same  motions  used  in  rabbeting  were  used  in  place  of 
cutting  an  actual  rabbet.  Vtiere  these  questions  arise,  the  developer 
of  the  test  has,  knowingly  or  not,  proposed  a criterion-related  hypo- 
thesis in  which  the  criterion  is  actual  job  performance.  The  purpose 
of  the  test  is  to  generalize  to  inferences  about  the  construct  of  job 
skill.  Its  construct  validity  nay  be  supported  in  part  by  the  logic 
of  the  content  sampling  leading  to  the  abstraction,  but  it  is  better 
supported  by  empirical  evidence  that  the  generalization  can  legiti- 
mately be  made. 

VALIDITY  GENERALIZATION 

Assuming  that  the  criterion- related  generalization  has  been 
empirically  supported,  the  next  question  that  arises  is  whether  that 
support  holds  only  in  specific  settings.  This  is  the  question  of 
validity  generalization.  Lawshe  (1952)  spoke  of  general izable 
criterion-related  validities  as  being  those,  like  typing  tests,  which 
had  found  so  much  support  that  situational  validity  studies  were  not 
needed. 
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Conventional  wisdom  has  long  asserted  that  criterion-related 
validities  need  to  be  determined  independently  in  each  situation  of 
application.  The  wide  spread  of  obtained  validity  coefficients,  such 
as  those  described  by  CSniselli  (1966)  was  taken  as  evidence  of  the 
need  to  determine  for  each  situation  the  validity  of  the  hypothesis 
in  that  situation  since  validities  vary  so  widely.  The  present  writer 
has  noticed  an  embarrassing  exairple  of  the  hypocrisy  of  the  demand 
for  situational  validation.  In  one  paragraph  (Guion,  1965,  p.  455) , 
it  was  pointed  out  that  personnel  testers  have  no  scientific 
generalizations  on  which  to  depend.  The  next  paragraph  identified  as 
the  first  of  three  "serious  flaws"  in  selection  research  the  tendency 
of  personnel  testers  to  stick  with  the  "safe"  tests  — those  for  which 
validity  had  been  repeatedly  demonstrated. 

The  Schmidt  and  Hunter  Study.  Schmidt  and  Hunter  (1977)  noted 
that  many  factors  may  be  responsible  for  obtaining  different  validity 
coefficients  in  specific  sanples  drawn  from  a cannon  population.  If, 
for  exanple,  the  population  consists  of  all  people  who  work  at  a 
specific  job  tested  by  a specific  type  of  test,  then  there  is  a true 
population  validity  coefficient.  Obtained  validity  coefficients  from 
finite  (and  usually  rather  small)  sanples  will  be  distributed  around 
that  value.  By  making  two  assumptions  about  criterion  reliability  and 
about  degree  of  restriction  of  range  in  these  sanple  — assumptions 
which  seem  not  to  be  unreasonable  — Schmidt  and  Hunter  determined  the 
probable  standard  deviation  of  the  sampling  distribution.  Their 
conclusion  was  that  the  variability  of  validity  coefficients  reported 
in  various  reviews  can  be  almost  entirely  explained  in  terms  of  these 
two  statistical  characteristics  of  the  sanples  used.  Vhen  one  adds 
further  art i factual  influences  on  obtained  validity  coefficients,  such 
as  differences  in  factor  structure  of  individual  tests  within  the  test 
type  or  of  criterion  measures  in  individual  studies,  the  room  for 


- 13  - 


real  situational  variables  accounting  for  differences  in  obtained 
validity  coefficients  is  slim  indeed. 

Limits  of  Generalizability.  The  good  news  according  to  Schmidt 
and  Hunter  should  not  be  prematurely  embraced  as  evidence  of  validity 

generalization  as  the  invariant  rule.  The  logic  of  much  of  the  field  / 

of  industrial  and  organizational  psychology  argues  against  such  an 

interpretation.  Attempts  to  inprove  training  will,  for  example,  vary 

widely  from  one  organization  to  another  so  that  the  effectiveness  of 

job  training  will  account  for  different  portions  of  criterion  variance. 

Organizational  development  programs,  human  relations  training  for  super- 
visors , and  attempts  to  provide  or  manipulate  reward  systems  are  all 
psychological  attempts  to  account  for  substantial  portions  of  perform- 
ance variables,  and  the  existence  and  success  of  such  attempts  will 
vary  widely.  Nonpsychological  influences  on  performance  should  account 
for  varying  portions  of  the  criterion  variance;  equipment  in  seme 
organizations  is  the  most  advanced  and  sophisticated  available;  in 
others,  it  is  said  to  be  held  together  with  chewing  gum  and  baling 
wire.  It  is  not  reasonable  to  assume  that  the  same  "true"  validity 
coefficient  applies  to  the  organization  with  the  best  leadership, 
training,  and  equipment  as  to  the  organization  where  no  advances  have 
been  made  in  these  fields. 

Yet  the  argument  of  the  Schmidt  and  Hunter  analysis  is  persuasive; 
evidence  can  be  found  of  the  generalizability  of  observed  relationships 
between  predictors  and  the  criteria  they  predict. 

The  resolution  of  the  apparent  contradiction  seems  obvious  enough; 
the  question  is  not  whether  relationships  generalize  beyond  the 
situation  in  which  they  are  first  observed,  but  rather  how  far  — 
within  what  limits  — may  that  generalizability  be  assumed. 
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This  question  has  not  been  addressed  in  the  literature  reviewed 
to  date.  Work  in  progress  under  a Fellowship  at  Bcwling  Green  State 
University  nay  point  to  one  approach  bo  it.  It  is  currently  concerned 
with  the  generalizability  of  various  relationships  at  the  management 
level.  As  preliminary  research,  three  taxonomies  are  being  developed:  / 

(a)  a taxonomy  of  management  occupations,  (b)  a taxonomy  of  manage- 
ment styles,  and  (c)  a taxonomy  of  situations.  As  the  work  progresses,  j 

it  will  try  out  certain  relationships  rather  well  established  in  the  I 

research  literature  in  each  of  several  of  the  cells  of  the  resulting 
three-dimensional  matrix.  If  variability  is  (a)  random  and  (b) 
explainable  in  terms  of  artifacts  such  as  those  described  by  Schmidt 
and  Hunter,  then  generalizability  across  the  domain  may  be  assumed. 

If  variability  is  (a)  random  and  (b)  greater  than  can  be  explained  by 
statistical  artifacts,  then  influences  on  validity  exist  which  have  not 
been  included  in  this  model.  If  variability  is,  however,  substantial 
and  fitting  a pattern,  then  the  three-dimensional  model  may  be  useful 
in  identifying  the  reasonable  limits  within  which  the  relationship 
may  be  said  to  generalize. 

Ihe  taxonomic  approach  has  serious  flaws,  not  the  least  of  which 
is  the  problem  of  arranging  the  cells  for  identifying  patterns  of 
obtained  coefficients.  It  is,  however,  one  approach  to  the  search 
for  limits  of  generalizability,  and  others  need  to  be  generated. 

GENERALIZABILITY  OF  SCORES 

The  most  completely  described  approach  to  generalizability  is 
that  of  Cronbach,  Gleser,  Nanda,  and  Rajaratnam  (1972)  applying  analy- 
sis of  variance  research  designs  to  the  study  of  the  dependability  of 
scores.  Although  the  method  and  theory  so  far  promulgated  refer  only 
to  the  generalizability  of  scores,  they  may  have  implications  for 
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modifications  that  can  help  identify  the  limits  of  validity  generaliza- 
tion as  well. 

General izability  theory  extends  classical  psychometric  theory  in 
several  ways  (Brennan,  1977;  Cronbach  et  al. , 1972).  The  most  hrrpor- 
tant  of  these  is  that  generalizability  theory  recognizes  distinct  J 

sources  of  error  in  any  measurement , in  contrast  to  a single,  undiffer- 
entiated source  of  error.  In  classical  best  theory,  reliability  is 
defined  as  the  proportion  of  observed  score  variance  attributable  to 
"true"  score  variance  (that  is,  not  random  error  variance).  General- 
izability theory  replaces  the  reliability  coefficient  with  the  coeffi- 
cient of  generalizability,  the  "true"  score  with  the  more  precisely 
delimited  universe  score,  and  random  error  variance  in  general  with 
specific  sources  of  error  variance. 

The  definition  of  a universe  depends  on  hew  the  investigator 
plans  to  interpret  the  measure.  Cronbach  et  al.  (1972)  identify  three 
possible  kinds  of  decisions;  classification  of  people  on  the  basis  of 
scores  into  two  or  more  categories,  comparisons  of  possible  courses 
of  action  for  the  same  persons,  or  normative  comparisons  of  people. 

The  defined  universe  of  generalization  depends  pretty  much  on  the 
kinds  of  decisions.  For  some  of  these  it  may  be  the  entire  universe 
of  admissible  observations.  For  others,  it  may  be  a subset.  A 
third  possibility,  that  it  may  be  something  outside  of  that  universe, 
has  been  rejected  by  Cronbach  and  his  associates  as  either  referring 
to  validity  or  to  an  overgeneralization. 

The  universe  score  is  the  expected  value  of  the  observed  score 
for  a person  over  the  universe  of  items.  The  coefficient  of  generaliz- 
ability is  the  ratio  of  universe  score  variance  to  expected  observed 
score  variance.  Clearly,  there  may  be  several  coefficients  of 
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generalizability;  each  of  these  generalizability  coefficients  expli- 
cates the  domain  to  which  the  coefficients  of  generalizability  are 
applicable. 

Perhaps  the  major  advantage  of  generalizability  theory  is  that 
it  makes  the  investigator  ask  questions  which  might  otherwise  not  be  > 

considered  (Cronbach,  1976).  That  is,  the  investigator  must  make  an 
explicit  choice  of  the  nature  of  the  universe  to  which  scores  are 
expected  to  generalize.  The  system  then  dictates,  at  least  in  large 
measure,  the  plan  or  design  for  data  collection  and  analysis.  The 
proper  design  depends  on  the  nature  of  the  universe  chosen. 

Replacing  "true  score"  with  "universe  score"  emphasizes  that  the 
investigator  is  making  inferences  from  a sanple  of  possible  observa- 
tions; the  choice  of  universe  emphasizes  that  there  is  more  than  one 
universe  to  which  an  investigator  might  wish  to  generalize.  Tto  be 
more  explicit,  if  a person  is  tested  on  a work  sanple,  he  may  himself 
be  considered  as  belonging  to  a population  of  rales,  Californians, 
whites , or  others;  he  ray  be  tested  in  a population  of  circumstances , 
involving  different  sanples  of  tasks,  or  time  of  day,  or  level  of 
environmental  hostility,  or  by  different  observers.  Thus,  in  general- 
izability theory,  any  measurement  of  an  attribute  is  considered  to  be 
a sanple  frcm  some  large  set  or  universe  of  possible  measurements , the 
universe  of  admissible  observations , defined  by  observations  in  all 
possible  combinations  of  circumstances. 

Circumstances  of  the  same  kind  are  called  facets.  ("Facets"  is 
chosen  in  preference  to  "factors"  — a more  comnon  term  in  analysis 
of  variance  — to  avoid  confusion  among  testers  with  the  factors  of 
factor  analysis  research. ) Within  the  universe  of  admissible  observa- 
tions, each  facet  consists  of  two  or  more  categories  or  conditions. 
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Facets  in  the  example  above  might  include  observers  or  task  difficulty 
levels  or  test  environments;  still  another  might  be  race  of  the  person. 
A decision  iraker  may  wish  to  generalize  the  results  of  measurement  to 
only  a limited  portion  of  the  overall  universe  defined  by  these  facets; 
that  portion  is  the  universe  of  generalization.  For  example,  a study 
of  supervisory  ratings  might  (conceivably)  include  as  facets  time  of 
day  of  rating,  nature  of  assignments  given  to  the  subordinate  who  is 
to  be  rated,  race  of  either  supervisor  or  subordinate  or  both,  and 
height  of  supervisor.  Ftor  particular  problems,  a decision  made  might 
wish  to  generalize  only  across  assignments  and  race  and  ignore  height 
or  time  of  day. 

In  designing  a study,  all  facets  that  might  influence  scores 
should  be  identified.  Cronbach  et  al.  (1972)  distinguished  between  a 
generalizability  study  (a  G study)  and  a decision  study  (a  D study) . 
The  G study  is  designed  to  estimate  corrponents  of  variance  attribu- 
table to  different  facets  among  all  those  that  might  account  for  sub- 
stantial portions  of  variance;  the  G study  therefore  defines  the 
universe  of  admissible  observations  broadly.  A variety  of  D studies 
might  follow  from  this  broad  definition,  each  of  which  uses  whatever 
information  is  relevant  to  the  decision  maker.  Ihe  distinction 
between  the  two  types  of  studies  is  conventional  but  not  very  useful 
in  this  discussion  of  work  sairple  testing.  All  work  sample  testing 
is  done  for  the  purpose  of  making  a decision  (even,  if  the  concept  of 
decision  is  broad  enough,  when  the  test  is  to  be  used  as  the  criterion 
in  the  criterion-related  validity  study  of  some  other  predictor) . The 
facets  that  influence  work  sairple  scores  are  those  that  indeed  influ- 
ence the  decisions.  If,  for  example,  an  abstracted  work  sample  does 
not  generalize  from  training  center  to  field  conditions,  but  a direct 
work  sairple  does,  then  the  decision  in  determining  a candidate's 
qualifications  for  the  field  job  will  be  based  on  the  latter.  The 
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discussion  that  follows  assumes  that  the  universe  of  qeneralizability 
for  general  evaluation  of  the  measurement  is  not  appreciably  different 
from  that  in  making  decisions  either  about  individuals  or  about  tests, 
and  the  distinction  will  not  be  maintained. 


SOURCES  OF  ERROR  IN  WORK  SAMPLES 

Glaser  and  Klaus  (1962)  discussed  six  sources  of  error  influenc- 
ing the  reliability  of  performance  measures: 


1.  The  test  environment  varies;  a driver’s  test  for  licensing, 
for  example,  may  be  given  during  good  weather,  rain,  or  snow. 

2.  System  instability  may  cause  unreliability;  fluctuations  in 
the  condition  of  equipment  (for  example,  wet  or  dry  brakes) 

or  in  the  behavior  of  other  people  in  the  system  (for  exanple, 
other  drivers  on  the  same  freeway)  influence  performance. 

3.  The  equipment  may  be  different;  in  the  above  exanple,  differ- 
ent people  drive  different  automobiles  of  different  ages  and 
different  systems  for  shifting  gears. 

4.  The  sanpling  of  tasks  may  vary.  The  tasks  in  a driver's 
examination  in  a major  metropolitan  area  are  different  from 
those  in  a rural  county.  In  one  state,  candidates  may  choose 
between  parallel  parking  and  driving  a zig-zag  obstacle  course. 

5.  Dimensions  describing  the  cortplexity  of  the  behavior  required 
by  the  task  may  vary;  they  should  be  consistent. 

6.  Examinees  reactions  vary  from  time  to  time.  Part  of  this  is 
pure  randan  variation,  but  part  of  it  may  be  due  to  differences 
in  conditions  at  different  times,  for  exarrple,  when  the  examinee 
is  closely  observed  or  when  the  examiner  is  of  a different  race. 


Since  work  sarrple  tests  are  often  scored  by  the  judge's  evaluation, 
a single  reliability  estinate  is  not  adequate.  At  the  very  least,  one 
needs  to  evaluate  the  reliability  c f the  observer  and  the  overall  test 
reliability  (Lang,  1978) . A major  advantage  of  qeneralizability  theory 
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is  that  one  can  design  a study  to  estimate  the  contribution  to  error 
that  each  of  these  different  sources  makes  and,  moreover,  to  find  a 
generalizability  coefficient  which  specifically  accounts  for  whichever 
source  of  error  is  of  particular  interest.  For  exairple,  one  can  find 
a coefficient  of  generalizability  which  describes  whether  the  evalua- 
tion of  performance  on  a work  sample  generalizes  over  different 
observers,  over  different  conditions  of  testing,  or  different  specific 
items  in  the  work  sample.  In  addition,  a generalizability  study  would 
help  an  investigator  decide  hew  many  judges  are  necessary  to  get  depend- 
able evaluations,  or  how  many  conditions  of  testing  might  be  necessary 
to  obtain  a satisfactory  generalizability  coefficient. 

DESIGNS 

Consider  a simple  design.  A sanple  of  items  is  presented  to  a sanple 

of  people.  In  this  design,  using  analysis  of  variance  terminology, 

items  and  people  are  crossed.  This  is  represented  by  the  Venn  diagram 

2 

in  Figure  1.  The  classical  estimate  of  error,  a (6) , includes  only  the 

variance  due  to  the  center  section  of  the  diagram.  That  ^s,  the 

portion  of  the  total  variances  due  to  persons  and  to  items  interacting. 

2 

Generalizability  theory  introduce',  a different  error  term,  o (A),  which 
incorporates  the  additional  error  due  to  item  sartpling  (see  Cronbach 
et  al. , 1972,  p.  24).  Here  the  person  is  the  object  of  the  measurement 
and  we  may  wish  to  generalize  over  items.  Brennan  (1977)  presented 
formulas  for  the  corrputations  and  discussed  the  design  in  greater 
depth.  An  inportant  distinction  which  he  made  related  to  the  appro- 
priateness of  each  of  these  in  content-referenced  tests.  He  pointed 

out  that,  whereas  o^(6)  is  a measure  of  the  variance  of  relative  error, 
o 

<t  (A)  is  a measure  of  the  variance  of  absolute  error  and  is  therefore 
much  more  appropriate  for  use  in  mastery  testing  and  other  forms  of 
content-referenced  measurement. 
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Dependability  of  Instructor  Ratings.  In  this  section,  two  more 
conplicated  designs  which  have  been  used  in  evaluating  student  ratings 
of  instruction  can  be  discussed  in  the  context  of  peer  ratings  or 
observer  ratings  used  in  evaluating  work  samples. 

Hie  key  issue  in  a study  reported  by  Kane,  Gillmore,  and  Crooks 
(1976)  was  the  dependability  of  the  mean  rating  of  an  instructor  over 
a class  of  students  and  a set  of  items.  The  design  was  basically  a 
split  plot  design;  students  were  nested  within  classes  and  crossed 
with  items,  as  illustrated  in  Figure  2.  Figure  2 identifies  five  sources 
of  variance:  class,  students  within  the  class  (confounding  the  student 
mean  effect  and  the  student-by-class  interaction),  items,  interaction 
of  items  and  class,  and  the  residual  (which  includes  the  student- by- 
item interaction,  the  student-by-class-by-item  interaction,  and  error). 
These  sources  of  variance  essentially  define  the  nature  of  a generaliz- 
ability  study. 

This  may  be  placed  in  a context  of  peer  ratings  of  trainee  perform- 
ance in  a sample  of  work.  The  sample  of  work  is  done  in  different 
settings.  Consider  it  to  be  a total  job  sample  observed  for  a period 
of  two  days  under  the  different  conditions.  The  conditions  might  be 
night  and  day.  After  the  two-day  period,  those  peers  who  have  had 
contact  with  the  trainee  might  rate  his  performance  on  a set  of  rating 
scale  items.  The  design  for  analysis  is  judges  or  peers  nested  within 
condition  and  crossed  with  items.  We  are  interested  in  the  depend- 
ability of  peer  ratings  of  the  trainees.  The  choice  of  which  of  three 
general izability  coefficients  is  appropriate  depends  on  whether  one 
wishes  to  generalize  over  judges  and  items  or  only  over  judges  or  only 
over  items.  Explicit  formulas  for  computing  these  generalizability 
coefficients  are  given  by  Kane  et  al.  (1976). 
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A study  such  as  this  provides  not  only  the  three  estimates  of 
generalizability,  but  it  would  provide  enough  information  so  that  the 
nunber  of  judges  and  the  number  of  items  oould  be  determined  for  the 
desired  size  of  the  generalizability  coefficient.  In  the  example 
involving  student  evaluation  of  teaching,  the  authors  recommended 
using  15  or  more  students  and  four  or  more  items  for  what  they  consid- 
ered to  be  a satisfactory  generalizability  coefficient  over  both 
students  and  items  — roughly  .70.  Moreover,  it  was  found  that  increas- 
ing the  nunber  of  students  had  a greater  impact  on  the  generalizability 
coefficient  than  increasing  the  number  of  items.  In  the  analogy  of 
the  peer  ratings  of  trainees,  this  would  mean  that  increasing  the  number 
of  peers  doing  the  rating  would  have  a greater  effect  than  increasing 
the  nunber  of  items  on  which  the  rating  is  done.  This  seems  intuitively 
logical  on  the  basis  of  earlier  work  showing  that  classical  reliabilities 
increase  as  a function  of  the  nunber  of  raters  in  accordance  with  the 
Spearman-Brown  prophecy  formula. 

Several  questions  might  be  asked  that  this  generalizability  design 
would  not  answer.  Differences  in  the  ratings  of  trainees  are  confounded 
with  conditions.  If  one  trainee  is  placed  in  a night  condition  and 
another  in  a day  condition,  differences  in  the  ratings  might  be  due 
either  to  differences  in  the  trainees  or  to  differences  in  the  condi- 
tions. In  the  school  example,  this  is  a confounding  of  teacher  and 
course  effects. 

It  is  possible  to  design  a study  which  would  separate  out  the 
trainee  effect  from  the  effect  of  conditions.  Perhaps  it  might  be 
useful  to  know  the  variety  of  conditions  over  which  performance  of 
the  trainee  is  expected  to  generalize.  The  obvious  solution  is  to 
have  a number  of  trainees  rated  in  each  of  severed  conditions.  Such 
a completely  crossed  design  may  often  not  be  feasible  in  practice.  If 
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not,  one  might  design  the  two  studies  pictured  in  Figure  3.  In  the 
first  of  these,  conditions  are  nested  within  trainees,  thereby 
providing  an  estimate  of  the  variance  ocrrponent  for  the  main  effect 
due  to  trainees,  and  there  is  also  an  estimate  of  the  variance  oorrponent 
for  the  condition  main  effect  confounded  with  the  condition  by  trainee 
interaction.  Cne  can  est irate  the  generalizability  of  the  average  rat- 
ing of  the  trainee  over  conditions  from  this  study.  A different  study, 
nesting  trainees  under  condition,  will  give  the  variance  ocrponent  due 
to  the  main  effect  of  conditions  and  an  independent  estimate  of  the 
variance  oonponent  for  the  trainee  main  effect  confounded  with  the 
trainee  by  condition  interaction.  This  study  will  give  an  estimate 
of  the  generalizability  or  dependability  of  the  average  rating  of  a 
condition,  generalizing  over  trainees.  Ccrrputational  details  for 
this  design  applied  in  an  educational  setting  have  been  reported  by 
Gillmore,  Kane,  and  Naocarato  (1978). 

3 

For  the  first  study  in  Figure  3,  there  are  2=8  possible  combi- 
nations of  facets  in  the  universe  of  admissible  operations.  We  can 
generalize  or  not  generalize  over  conditions,  raters,  and  items  when 
the  object  of  measurement  is  the  trainee.  One  can  determine  hew  the 
generalizability  coefficients  will  change  when  the  number  of  conditions 
ox  raters  or  items  is  changed.  The  logic  involved  in  analyzing  the 
second  of  these  studies  is  similar. 

Instead  of  conditions  (night  vs.  day) , the  condition  facet  might 
have  included  different  field  conditions  ranging  from  highly  supportive, 
as  in  the  training  situation,  to  exceedingly  hostile,  as  in  realistic 
maneuvers  or  aombat.  A different  facet  might  be  region  or  part  of  the 
country  or  theater  of  operation,  another  one  might  be  the  nature  of 
terrain,  still  another  might  be  organizational  characteristics , such 
aa  structure  or  "climate. " In  designing  a generalizability  study, 
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one  basically  needs  to  identify  precisely  what  it  is  that  is  being 
measured  (for  exanple,  peer  ratings  of  trainees)  and  the  facets  to 
which  that  measurement  needs  to  be  generalized.  Brennan  (1977) 
clearly  covers  specific  details  for  five  common  designs.  Cronbach 
et  al.  (1972)  identifies  a broader  range  of  possible  designs. 

SUMMARY 

The  ambiguous  nature  of  the  term,  generalizability,  when  examined 
leads  to  interesting  and  important  research  on  the  evaluation  of  work 
sample  tests.  Three  meanings  of  the  term  have  been  identified: 
Generalizability  of  content  to  a broader  domain,  generalizability  of 
relationships  as  in  criterion-related  validities,  and  generalizability 
of  scores  over  conditions  of  measurement. 

Cronbach  and  his  associates  have  explicitly  excluded  traditional 
concerns  for  criterion-related  validity  from  their  notion  of  general- 
izability theory.  A form  of  construct  validity  is  demonstrated  when 
a measure  such  as  a work  sample,  which  is  defined  primarily  as  a sample 
of  job  observations , is  found  to  generalize  over  important  varieties 
of  conditions. 

A consideration  of  the  Schmidt  and  Hunter  (1977)  work  suggests 
a further  possibility  for  applying  generaL.zability  theory  to  the 
determination  of  the  limits  of  generalizability  in  validity  general- 
ization studies.  Validity  generalization  studies  such  as  those  con- 
ducted by  Schmidt  and  Hunter  involve  the  accumulation  of  very  large 
distributions  of  validity  coefficients.  These  validity  coefficients 
(with  the  proper  transformation  to  avoid  the  biased  sampling  distri- 
bution) , can  be  treated  as  scores.  The  objects  of  measurement  are 
not  persons  but  studies.  Facets  describing  the  various  kinds  of 
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conditions  in  which  validation  studies  are  performed  can  be  identified. 

An  obvious  example  is  the  distinction  between  predictive  and  concurrent 
studies.  A less  obvious  distinction  might  call  for  information  about 
the  subject  populations  and  their  degree  of  urbanization,  or  it  might 
call  for  information  about  structural  characteristics  of  the  organiza- 
tions. It  may  be  that  a question  might  arise  over  the  generalizability  t 

of  a particular  validity  coefficient  over  different  industries,  so  that  | 

the  industry  of  choice  might  be  a facet.  Generalizability  coefficients 
can  be  computed  for  generalizability  over  various  facets  of  interest 
and,  by  this  technique,  potential  limits  to  the  generalizability  of 
a validity  coefficient  can  be  obtained.  The  issue  of  the  limits  of 
generalizability  seems  to  be  a very  important  one.  Proper  use  of 
generalizability  theory  designs  can  lead  to  defined  limits  of  the 
generalizability  of  work  sample  saores  and  of  the  generalizability  of 
criterion-related  validity  coefficients. 
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